Data Visualization of Lunch Form¶

image.png

Goal of the notebook: I will be using the tool Bokeh to visualize data I have collected. This data is a csv file created from the generated questions and used on different questions answering models based on a Lunch form. I will be trying out different plots to see which ones are better suited based on the audience I am showing the data/results to.

Side Note: Later in the notebook I change from using the Bokeh tool to using the Plotly visualization tool. The reason for switching over to plotly is for easy to use visualization tool which is also new for me.

Background Information of the Data¶

General flow Concept to how the data is created¶

flow-diagram-general-flow.PNG

General Concept to how the data is created Question-Answer models work¶

question_answering_flow.PNG

Context used for the data¶

context_validation.PNG

1. Importing Libraries¶

In [3]:
import pandas as pd
import os  
import numpy as np

# Circle
from math import pi

from bokeh.palettes import Category20c
from bokeh.transform import cumsum
from bokeh.plotting import figure, output_notebook, show

from squarify import normalize_sizes, squarify

from bokeh.sampledata.sample_superstore import data
from bokeh.transform import factor_cmap

import plotly.express as px
import plotly
from dash import Dash, dcc, html, Input, Output

2. Importing data Merged Data frame¶

The data that will be used are csv files created from the generated questions and used on different questions answering models. The data is a combination of answers different models predicted for each label of a form.

In [30]:
df = pd.read_csv(r'C:\Users\victo\source\repos\Semester 7\JupyterLab\Group\Question Generator\csv_ouput\df_merged.csv', index_col=[0])
# delete one by one like column is 'Unnamed: 0' so use it's name
# df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()
Out[30]:
label questions answer score model percentage actual_answer correctly_predicted occurence
0 Number of Attendees what Number of Attendees? 15 62.77 roberta-base-model 1.539416 15 True 1
1 Number of Attendees who Number of Attendees? 15 67.06 roberta-base-model 1.644627 15 True 1
2 Number of Attendees where Number of Attendees? 15 46.99 roberta-base-model 1.152416 15 True 1
3 Number of Attendees when Number of Attendees? 15 55.51 roberta-base-model 1.361367 15 True 1
4 Number of Attendees why Number of Attendees? 15 55.14 roberta-base-model 1.352293 15 True 1

3. Exloratory Data Analysis using Bokeh¶

In this chapter I will be exploring the data using Bokeh.

3.1. PieCharts¶

I will be visually representing the different labels that are used in the dataset

In [31]:
x = df.label.value_counts()
data = pd.Series(x).reset_index(name='value').rename(columns={'index': 'country'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = Category20c[len(x)]
data
Out[31]:
country value angle color
0 Number of Attendees 59 1.256637 #3182bd
1 Date 55 1.171441 #6baed6
2 End Time 43 0.915854 #9ecae1
3 Start Time 41 0.873256 #c6dbef
4 Budget 27 0.575071 #e6550d
5 Contact Details 24 0.511174 #fd8d3c
6 Organizer 20 0.425979 #fdae6b
7 Location 10 0.212989 #fdd0a2
8 Food Allergies 9 0.191690 #31a354
9 Food Diets 7 0.149093 #74c476
In [32]:
p = figure(height=350, title="Pie Chart", toolbar_location=None,
           tools="hover", tooltips="@country: @value", x_range=(-0.5, 1.0))

p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='country', source=data)

p.axis.axis_label = None
p.axis.visible = False
p.grid.grid_line_color = None
output_notebook()
show(p)
Loading BokehJS ...

So a little background of the data. The data is based questions created for a lunch form. The lunch form has different kinds of labels which we will touch down in a bit but as we can see from the above piechart, we see the different form labels.

3.2. Treemaps¶

Creating a tree map using Bokeh's example and applying to the data being used in this notebook.

In [33]:
def treemap(df, col, x, y, dx, dy, *, N=100):
    sub_df = df.nlargest(N, col)
    normed = normalize_sizes(sub_df[col], dx, dy)
    blocks = squarify(normed, x, y, dx, dy)
    blocks_df = pd.DataFrame.from_dict(blocks).set_index(sub_df.index)
    return sub_df.join(blocks_df, how='left').reset_index()
In [34]:
df_correct_prediction = df[df.correctly_predicted != False]
df_correct_prediction.head()
Out[34]:
label questions answer score model percentage actual_answer correctly_predicted occurence
0 Number of Attendees what Number of Attendees? 15 62.77 roberta-base-model 1.539416 15 True 1
1 Number of Attendees who Number of Attendees? 15 67.06 roberta-base-model 1.644627 15 True 1
2 Number of Attendees where Number of Attendees? 15 46.99 roberta-base-model 1.152416 15 True 1
3 Number of Attendees when Number of Attendees? 15 55.51 roberta-base-model 1.361367 15 True 1
4 Number of Attendees why Number of Attendees? 15 55.14 roberta-base-model 1.352293 15 True 1
In [35]:
df_correct_prediction.shape
Out[35]:
(295, 9)
In [36]:
a = df['model'].unique()
models = sorted(a)

print(sorted(models))
['bert-large', 'bert-medium', 'deberta', 'distilbert-cased-model', 'distilbert-uncased-model', 'roberta-base-model']
In [37]:
score_by_label = df_correct_prediction.groupby(["model", "label"]).sum("correctly_predicted")
score_by_label = score_by_label.sort_values(by="correctly_predicted").reset_index()

score_by_model = score_by_label.groupby("model").sum("correctly_predicted").sort_values(by="correctly_predicted")
score_by_model
Out[37]:
score percentage correctly_predicted occurence
model
distilbert-cased-model 2557.03 146.987787 39 39
distilbert-uncased-model 2195.60 90.754293 41 41
bert-medium 1213.33 43.869068 47 47
roberta-base-model 1035.87 40.455861 49 49
deberta 1942.50 105.797038 57 57
bert-large 2117.84 102.528483 62 62
In [38]:
x, y, w, h = 0, 0, 800, 450

blocks_by_model = treemap(score_by_model, "correctly_predicted", x, y, w, h)

dfs = []
for index, (model, score, percentage,correctly_predicted,occurence,x, y, dx, dy) in blocks_by_model.iterrows():
    df_score = score_by_label[score_by_label.model==model]
    # print(df_score)
    dfs.append(treemap(df_score, "correctly_predicted", x, y, dx, dy, N=10))
blocks = pd.concat(dfs)

p = figure(width=w, height=h, tooltips="@label", toolbar_location=None,
           x_axis_location=None, y_axis_location=None)
p.x_range.range_padding = p.y_range.range_padding = 0
p.grid.grid_line_color = None

p.block('x', 'y', 'dx', 'dy', source=blocks, line_width=1, line_color="white",
        fill_alpha=0.8, fill_color=factor_cmap("model", "MediumContrast4", models))

p.text('x', 'y', x_offset=2, text="model", source=blocks_by_model,
       text_font_size="18pt",  text_color="white")

blocks["ytop"] = blocks.y + blocks.dy
p.text('x', 'ytop', x_offset=2, y_offset=2, text="label", source=blocks,
       text_font_size="6pt", text_baseline="top",
       text_color=factor_cmap("model", ("black", "white", "black", "white","black", "white"), models))

show(p)

4. Exploratory Data Analysis using Plotly¶

After seeing how hard it was to set up a treemap in bokeh compared to plotly i chose to do my visualization in plotly for easy to use.

4.1. Treemaps¶

4.1.1. Visualizing Number of Occurence¶

I will be visualizing the number of occurences each model predicts the correct answer for each label to see which model performs the best overal.

In [39]:
fig = px.treemap(df_correct_prediction, path=[px.Constant("Lunch Form"), 'label', 'model', 'actual_answer'], values='occurence', title="Predicting correct answer occurence per label based on each model")
fig.update_traces(root_color="lightgrey", marker=dict(cornerradius=5))
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
In [40]:
fig = px.treemap(df_correct_prediction, path=[px.Constant("Lunch Form"), 'model', 'label'], values='occurence', title="Predicting correct answer occurence per model based on each form label")
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

4.1.2. Visualizing total Confidence Score¶

I will be visualizing the total confidence score for each model of predicting the correct answer for each label to see which model performs the best overal.

In [41]:
fig = px.treemap(df_correct_prediction, path=[px.Constant("Lunch Form"), 'label', 'model', 'actual_answer'], values='score', title="Prediction confidence score per label for each model")
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
In [42]:
import plotly.express as px
fig = px.treemap(df_correct_prediction, path=[px.Constant("Lunch Form"), 'model', 'label'], values='score', title="Prediction confidence score per model based for each form label")
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

4.2. PieCharts¶

4.2.1 Basic Pie Chart¶

In [17]:
fig = px.pie(df_correct_prediction, values='occurence', names='label',width=1000, height=500, color_discrete_sequence=px.colors.sequential.RdBu, title='Occurence of Predicting Correct Answer')
fig.show()

4.2.2. Pie chart in Dash¶

In [46]:
app = Dash(__name__)


app.layout = html.Div([
    html.H4('Analysis of the question answering models performances'),
    dcc.Graph(id="graph"),
    html.P("Names:"),
    dcc.Dropdown(id='names',
        options=['label', 'model', 'questions'],
        value='model', clearable=False
    ),
    html.P("Values:"),
    dcc.Dropdown(id='values',
        options=['score', 'percentage', 'occurence'],
        value='score', clearable=False
    ),
])
In [47]:
@app.callback(
    Output("graph", "figure"), 
    Input("names", "value"), 
    Input("values", "value"))
def generate_chart(names, values):
    # df = px.data.tips() # replace with your own data source
    fig = px.pie(df, values=values, names=names, hole=.3)
    return fig

if __name__ == '__main__':
    app.run_server()
    # app.run_server(debug=True)
Dash is running on http://127.0.0.1:8050/

 * Serving Flask app 'Data-Visualization-Exercise' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:8050/ (Press CTRL+C to quit)
127.0.0.1 - - [04/Apr/2023 09:30:14] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [04/Apr/2023 09:30:14] "GET /_dash-layout HTTP/1.1" 200 -
127.0.0.1 - - [04/Apr/2023 09:30:14] "GET /_dash-dependencies HTTP/1.1" 200 -
127.0.0.1 - - [04/Apr/2023 09:30:14] "GET /_dash-component-suites/dash/dcc/async-graph.js HTTP/1.1" 200 -
127.0.0.1 - - [04/Apr/2023 09:30:14] "GET /_dash-component-suites/dash/dcc/async-plotlyjs.js HTTP/1.1" 200 -
127.0.0.1 - - [04/Apr/2023 09:30:14] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [04/Apr/2023 09:30:14] "GET /_dash-component-suites/dash/dcc/async-dropdown.js HTTP/1.1" 200 -
127.0.0.1 - - [04/Apr/2023 09:30:19] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [04/Apr/2023 09:30:24] "POST /_dash-update-component HTTP/1.1" 200 -

4.3. Dot Plot¶

Dot plots (also known as Cleveland dot plots) are scatter plots with one categorical axis and one continuous axis. They can be used to show changes between two (or more) points in time or between two (or more) conditions. Compared to a bar chart, dot plots can be less cluttered and allow for an easier comparison between conditions.

In [43]:
fig = px.scatter(score_by_label.sort_values('model'), y="label", x="correctly_predicted", color="model", symbol="model", 
             title='Number of Predicted Correct Answer per Label for each model')
fig.update_traces(marker_size=10)
fig.show()

4.4. Horizontal Bar Charts in Python¶

4.4.1. Visualizing Number of Occurence¶

I will be visualizing the number of occurences each model predicts the correct answer for each label to see which model performs the best overal with using the horizontal bar chart.

In [44]:
fig = px.bar(score_by_label.sort_values('model'), x="correctly_predicted", y="label", color='model', orientation='h',
             hover_data=["correctly_predicted", "score"],
             height=400,
             title='Number of Predicted Correct Answer per Label for each model')
fig.show()

4.4.2. Visualizing total Confidence Score¶

I will be visualizing the total confidence score for each model of predicting the correct answer for each label to see which model performs the best overal with using the horizontal bar chart.

In [48]:
fig = px.bar(score_by_label.sort_values('model'), x="score", y="label", color='model', orientation='h',
             hover_data=["correctly_predicted", "score"],
             height=400,
             title='Sum of prediction confidence score of each labels based on types of models')
fig.show()

4.5. Sunburst Charts¶

Sunburst plots visualize hierarchical data spanning outwards radially from root to leaves. Similar to Icicle charts and Treemaps, the hierarchy is defined by labels (names for px.icicle) and parents attributes. The root starts from the center and children are added to the outer rings.

4.5.1. Visualizing Number of Occurence¶

I will be visualizing the number of occurences each model predicts the correct answer for each label to see which model performs the best overal.

In [49]:
fig = px.sunburst(score_by_label, path=['label', 'model'],width=1000, height=500, values='correctly_predicted',
             title='Number of Predicted Correct Answer per Label for each model')
fig.show()

4.5.2. Visualizing total Confidence Score¶

I will be visualizing the total confidence score for each model of predicting the correct answer for each label to see which model performs the best overal.

In [50]:
fig = px.sunburst(score_by_label, path=['label', 'model'], values='score',width=1000, height=500, title = 'Prediction confidence score of each labels based on types of models')
fig.show()

4.6. Icicle Charts¶

Icicle charts visualize hierarchical data using rectangular sectors that cascade from root to leaves in one of four directions: up, down, left, or right. Similar to Sunburst charts and Treemaps charts, the hierarchy is defined by labels (names for px.icicle) and parents attributes. Click on one sector to zoom in/out, which also displays a pathbar on the top of your icicle. To zoom out, you can click the parent sector or click the pathbar as well.

In [51]:
fig = px.icicle(score_by_label, path=[px.Constant("Lunch Form"), 'label', 'model'], values='correctly_predicted',
             title='Number of Predicted Correct Answer per Label for each model')
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

4.7. Patterned Charts¶

In [52]:
fig = px.area(score_by_label, x="label", y="correctly_predicted", color="model", pattern_shape="model",
             title='Number of Predicted Correct Answer per Label for each model')
fig.show()

fig.write_html(r"C:\Users\victo\source\repos\Semester 7\JupyterLab\Data Visualization\file.html")

Conclusion¶

To conclude, this notebook shows different visualization plots that express the data collected with both using the Bokeh and Plotly tools. During this exercise I not only learned how to use these two tools but also different ways to visualize my data in a more user friendly way of interacting with the data which I personally like.

In [53]:
plotly.offline.init_notebook_mode()